This paper studies the distribution estimation of contaminated data by the MoM-GAN method, which combines generative adversarial net (GAN) and median-of-mean (MoM) estimation. We use a deep neural network (DNN) with a ReLU activation function to model the generator and discriminator of the GAN. Theoretically, we derive a non-asymptotic error bound for the DNN-based MoM-GAN estimator measured by integral probability metrics with the $b$-smoothness H\"{o}lder class. The error bound decreases essentially as $n^{-b/p}\vee n^{-1/2}$, where $n$ and $p$ are the sample size and the dimension of input data. We give an algorithm for the MoM-GAN method and implement it through two real applications. The numerical results show that the MoM-GAN outperforms other competitive methods when dealing with contaminated data.
translated by 谷歌翻译
As the basis for prehensile manipulation, it is vital to enable robots to grasp as robustly as humans. In daily manipulation, our grasping system is prompt, accurate, flexible and continuous across spatial and temporal domains. Few existing methods cover all these properties for robot grasping. In this paper, we propose a new methodology for grasp perception to enable robots these abilities. Specifically, we develop a dense supervision strategy with real perception and analytic labels in the spatial-temporal domain. Additional awareness of objects' center-of-mass is incorporated into the learning process to help improve grasping stability. Utilization of grasp correspondence across observations enables dynamic grasp tracking. Our model, AnyGrasp, can generate accurate, full-DoF, dense and temporally-smooth grasp poses efficiently, and works robustly against large depth sensing noise. Embedded with AnyGrasp, we achieve a 93.3% success rate when clearing bins with over 300 unseen objects, which is comparable with human subjects under controlled conditions. Over 900 MPPH is reported on a single-arm system. For dynamic grasping, we demonstrate catching swimming robot fish in the water.
translated by 谷歌翻译
Scene text spotting is of great importance to the computer vision community due to its wide variety of applications. Recent methods attempt to introduce linguistic knowledge for challenging recognition rather than pure visual classification. However, how to effectively model the linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting. Firstly, the autonomous suggests enforcing explicitly language modeling by decoupling the recognizer into vision model and language model and blocking gradient flow between both models. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for the language model which can effectively alleviate the impact of noise input. Finally, to polish ABINet++ in long text recognition, we propose to aggregate horizontal features by embedding Transformer units inside a U-Net, and design a position and content attention module which integrates character order and content to attend to character features precisely. ABINet++ achieves state-of-the-art performance on both scene text recognition and scene text spotting benchmarks, which consistently demonstrates the superiority of our method in various environments especially on low-quality images. Besides, extensive experiments including in English and Chinese also prove that, a text spotter that incorporates our language modeling method can significantly improve its performance both in accuracy and speed compared with commonly used attention-based recognizers.
translated by 谷歌翻译
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.
translated by 谷歌翻译
Health sensing for chronic disease management creates immense benefits for social welfare. Existing health sensing studies primarily focus on the prediction of physical chronic diseases. Depression, a widespread complication of chronic diseases, is however understudied. We draw on the medical literature to support depression prediction using motion sensor data. To connect human expertise in the decision-making, safeguard trust for this high-stake prediction, and ensure algorithm transparency, we develop an interpretable deep learning model: Temporal Prototype Network (TempPNet). TempPNet is built upon the emergent prototype learning models. To accommodate the temporal characteristic of sensor data and the progressive property of depression, TempPNet differs from existing prototype learning models in its capability of capturing the temporal progression of depression. Extensive empirical analyses using real-world motion sensor data show that TempPNet outperforms state-of-the-art benchmarks in depression prediction. Moreover, TempPNet interprets its predictions by visualizing the temporal progression of depression and its corresponding symptoms detected from sensor data. We further conduct a user study to demonstrate its superiority over the benchmarks in interpretability. This study offers an algorithmic solution for impactful social good - collaborative care of chronic diseases and depression in health sensing. Methodologically, it contributes to extant literature with a novel interpretable deep learning model for depression prediction from sensor data. Patients, doctors, and caregivers can deploy our model on mobile devices to monitor patients' depression risks in real-time. Our model's interpretability also allows human experts to participate in the decision-making by reviewing the interpretation of prediction outcomes and making informed interventions.
translated by 谷歌翻译
在本文中,我们为RSI(名为Superyolo)提出了一种准确而快速的小对象检测方法,该方法融合了多模式数据并通过利用辅助超级分辨率(SR)学习并考虑既有辅助的超级分辨率(SR)对象进行高分辨率(HR)对象检测检测准确性和计算成本。首先,我们通过删除焦点模块来保持人力资源特征并显着克服小物体缺失的误差来构建紧凑的基线。其次,我们利用像素级的多模式融合(MF)从各种数据中提取信息,以促进RSI中的小物体更合适和有效的功能。此外,我们设计了一个简单且灵活的SR分支来学习HR特征表示,可以区分具有低分辨率(LR)输入的庞大背景的小物体,从而进一步提高了检测准确性。此外,为避免引入其他计算,SR分支在推理阶段被丢弃,并且由于LR输入而减少了网络模型的计算。实验结果表明,在广泛使用的Vedai RS数据集上,Superyolo的精度为73.61%(在MAP50方面),比SOTA大型模型(例如Yolov5L,Yolov5X和RS设计的Yolors)高10%以上。同时,Superyolo的Gfolps和参数大小比Yolov5X少约18.1倍,4.2倍。我们提出的模型显示出与最新模型相比,具有良好的准确性速度权衡。该代码将在https://github.com/icey-zhang/superyolo上开放。
translated by 谷歌翻译
视觉变压器(VIT)正在出现,并且在计算机视觉任务中的准确性显着提高。但是,它们的复杂架构和巨大的计算/存储需求对新硬件加速器设计方法施加了紧迫的需求。这项工作提出了基于提议的混合速度量化的FPGA感知自动VIT加速框架。据我们所知,这是探索模型量化的第一个基于FPGA的VIT加速框架。与最先进的VIT量化工作(仅无硬件加速的算法方法)相比,我们的量化在相同的位宽度下可实现0.47%至1.36%的TOP-1精度。与32位浮点基线FPGA加速器相比,我们的加速器在框架速率上的提高约为5.6倍(即56.8 fps vs. 10.0 fps),对于DeitBase的ImagEnet数据集,精度下降了0.71%。
translated by 谷歌翻译
配备摄像机的无人机可以显着增强人类在3D空间中具有显着的可操作性,从而使人类感知世界的能力。具有讽刺意味的是,无人机的对象检测始终是在2D图像空间中进行的,这从根本上限制了其理解3D场景的能力。此外,由于缺乏变形模型,无法直接应用于为自动驾驶开发的现有3D对象检测方法,这对于具有敏感变形和小物体的遥远空中透视至关重要。为了填补空白,这项工作提出了一个名为DVDET的双视检测系统,以在2D图像空间和3D物理空间中实现空中单眼对象检测。为了解决严重的视图变形问题,我们提出了一个可训练的可训练的可训练的转换模块,该模块可以从无人机的角度正确地扭曲信息到BEV。与汽车的单眼方法相比,我们的转换包括一个可学习的可变形网络,可显式修改严重的偏差。为了应对数据集挑战,我们提出了一个名为AM3D-SIM的新的大规模模拟数据集,该数据集由AirSim和Carla的共模制成,以及一个名为AM3D-REAL的新的现实世界空中数据集,由DJI Matrice 300 RTK收集,在两个数据集中,都提供了3D对象检测的高质量注释。广泛的实验表明,i)空中单眼3D对象检测是可行的; ii)在仿真数据集中预先训练的模型受益于现实世界的性能,iii)DVDET也有益于汽车的单眼3D对象检测。为了鼓励更多的研究人员调查该领域,我们将在https://sjtu-magic.github.io/dataset/am3d/中发布数据集和相关代码。
translated by 谷歌翻译
在本文中,我们探讨了一个新的知识障碍问题,称为联合选择性聚合(FEDSA)。 FEDSA的目的是在几位分散的教师的帮助下培训学生模型,以完成一项新任务,他们的预培训任务和数据是不同且不可知的。我们调查此类问题设置的动机源于最近的模型共享困境。许多研究人员或机构已经在培训大型且称职的网络上花费了巨大的资源。由于隐私,安全或知识产权问题,他们也无法分享自己的预培训模型,即使他们希望为社区做出贡献。拟议的FEDSA提供了解决这一困境的解决方案,并使其更进一步,因为学识渊博的学生可以专门从事与所有老师不同的新任务。为此,我们提出了一种处理FEDSA的专门战略。具体而言,我们的学生培训过程是由一种新型的基于显着性的方法驱动的,该方法可以适应教师作为参与者,并将其代表性能力融入到学生中。为了评估FEDSA的有效性,我们在单任务和多任务设置上进行实验。实验结果表明,FEDSA有效地将分散模型的知识融合在一起,并将竞争性能达到集中式基准。
translated by 谷歌翻译
辐射脑病(REP)是鼻咽癌(NPC)放疗最常见的并发症。非常希望协助临床医生优化NPC放射疗法方案,以减少放射疗法诱导的颞叶损伤(RTLI),该疗程根据REP发作的可能性。据我们所知,这是通过在NPC放射治疗方案中共同利用图像和非图像数据来预测放疗诱导的REP的首次探索。我们将代表预测作为生存分析任务,并根据一致性指数(CI)评估预测准确性。我们设计了一个深层多模式生存网络(MSN),该网络(MSN)具有两个特征提取器,以从多模式数据中学习判别特征。一个功能提取器在非图像数据上施加特征选择,另一个功能提取器从图像中学习视觉特征。因为直接使CI最大化的CI(BCI)损耗函数对每批采样不均匀。因此,我们提出了一种新型的加权CI(WCI)损失函数,以通过双平均操作分配其不同的权重有效地利用所有REP样本。我们进一步引入了WCI温度高参数,以增强样本对的风险差异,以帮助建模收敛。我们在私人数据集上广泛评估WCI,以证明其对同行的可爱性。实验结果还表明,NPC放射疗法的多模式数据可以为REP风险预测带来更多收益。
translated by 谷歌翻译